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Abstract — Current  sensor  collection  capabilities  produce  an  in¬ 
credible  amount  of  data  that  needs  to  be  processed,  analyzed,  and 
distributed  in  a  timely  and  efficient  manner.  Information  Man¬ 
agement  (IM)  services  supporting  a  publish-subscribe  and  query 
paradigm  can  be  a  powerful  general  purpose  approach  to  ena¬ 
bling  this  information  exchange  between  decoupled  and  dynamic 
information  producers  and  consumers.  These  IM  services  will 
only  be  of  value,  however,  if  they  can  support  operations  in  a 
manner  that  is  responsive  to  the  sheer  quantity  and  frequency  of 
data  produced  by  surveillance  platforms.  Cloud  computing  is  the 
technology  of  choice  for  providing  the  resources  and  services 
needed  to  enable  and  mange  large-scale  distributed  computation. 
To  date,  there  has  been  little  work  to  develop  highly  scalable, 
dynamic  IM  processing  and  dissemination  services  in  a  cloud 
computing  environment.  In  this  paper  we  discuss  our  design, 
implementation  and  evaluation  of  a  prototype  cloud-based  in¬ 
formation  broker  which  is  a  critical  component  of  a  highly  scala¬ 
ble,  distributed  IM  System.  The  brokering  prototype  is  designed 
using  a  distributed  stream  processing  framework  and  is  shown  to 
scale  nearly  linearly  with  the  number  of  computing  nodes  as  in¬ 
formation  load  and  subscription  quantity  increases. 
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1.  Introduction 

Real-time  and  near-real-time  information  sharing  eontinues 
to  inerease  in  importanee  for  many  military,  seeurity  and  intel- 
ligenee  domains.  This  has  led  to  ehallenges  in  information 
management  that  has  been  exaeerbated  by  the  proliferation  of 
inereasingly  eapable  sensors.  The  volume  of  available  sensed 
information  and  the  eonsumers  of  this  information  have  led  to 
the  following  two  ehallenges: 

•  How  to  distribute  large  amounts  and  high  rates  of  sensor 
information  to  the  proper  eonsumers  without  burying  the 
eonsumers  in  extraneous  information. 

•  How  to  proeess  and  filter  eontent-rieh  information  des¬ 
tined  for  eonsumers  at  seale  and  high  speeds. 

Publish-subseribe  brokering  has  emerged  to  solve  the  for¬ 
mer  of  these  ehallenges.  Brokering  is  the  task  of  matehing  in- 
eoming  published  data  against  a  set  of  subseriptions,  whieh  is 
the  eore  fiinetionality  of  publish-subseribe  systems.  Within 
sueh  systems,  eonsumers  register  subseriptions  and  publishers, 
sueh  as  sensors,  publish  data.  The  information  broker  matehes 
published  information  with  subseriptions,  and  forwards  the 
published  information  to  subseribed  eonsumers. 
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In  reeent  years  a  eanonieal  problem  in  information  broker¬ 
ing  for  has  been  to  develop  approaehes  that  ean  seale  to  handle 
large  volumes  of  information  generated  by  sensor  networks. 
Unfortunately,  up  until  now,  there  have  been  few  eapabilities  to 
suffieiently  seale  key  information  brokering  eapabilities  to 
support  the  inereasingly  large  volumes  of  information  generat¬ 
ed  by  sensor  networks.  Information  brokering  ean  be  eomputa- 
tionally  expensive  and  its  eost  ean  inerease  non-linearly  with 
the  number  of  subseribers  and  the  eomplexity  of  their  subserip¬ 
tions.  Additionally,  information  brokering  needs  to  be  able  to 
handle  large  amounts  of  aperiodie  and  periodie  input  data  of 
heterogeneous  sizes  and  data  formats  in  real  time  or  near-real 
time. 

Cloud  eomputing  has  emerged  as  a  powerful  medium  for 
affordable  large-seale  eomputing  whieh  eould  be  used  to  ad¬ 
dress  the  brokering  problem.  Cloud  eomputing  provides  infra- 
strueture  aeeess,  software  lieensing,  training  and  maintenanee 
bundled  into  large  eommodity  eomputing  data  eenters  that  sup¬ 
port  elastie  resouree  alloeation.  Cloud  eomputing  eould  help 
enable  broader  distributed  information  interaetions  built  upon 
the  publish-subseribe  model  with  real-time  and  network-eentrie 
operational  requirements.  To  date,  however,  little  has  been 
done  to  support  publish-subseribe  in  eloud  environments,  de¬ 
spite  the  advantage  of  enabling  sealable  brokering  at  high 
speed  larger  numbers  of  elients,  and  more  eomplex  filtering 
and  brokering.  In  this  paper  we  deseribe  our  design,  develop¬ 
ment,  experimentation  and  analysis  of  SCIMITAR,  a  eloud- 
based  information  brokering  eapability  implemented  using  the 
Storm  Stream  Proeessing  Framework  [16]. 

The  remainder  of  this  doeument  is  organized  as  follows.  In 
Seetion  II  we  diseuss  the  needs  for  high-performanee  broker¬ 
ing.  In  Seetion  III  we  diseuss  stream  proeessing  teehnologies 
whieh  we  use  as  the  basis  for  our  SCIMITAR  prototype.  In 
Seetion  IV  we  present  our  overall  design  and  implementation 
details  for  our  information  brokering  eoneept.  In  Seetion  V  we 
diseuss  our  experimental  evaluation  of  our  eloud-based  broker¬ 
ing  prototype.  In  Seetion  VI  diseuss  eonelusions  and  possible 
future  work. 

II.  The  Information  Brokering  Challenge 

Eugster  et  al  [7]  provides  an  overview  of  the  pub-sub  in- 
teraetion  pattern,  highlighting  the  deeoupled  nature  of  publish¬ 
ers  and  subseribers  in  time,  spaee,  and  synehronization.  There 
have  been  multiple  prior  approaehes  to  publish-subseribe  sys¬ 
tems  ineluding  the  OMG  Data-Distribution  Serviee  (DDS) 
Speeifieation  [19]  and  JMS  [20].  For  the  eommon  pub-sub 
brokering  eapabilities  we  are  attempting  to  improve  upon. 
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Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 

1.  REPORT  DATE 

NOV  2013  TYPE 

4.  TITLE  AND  SUBTITLE 

SCIMITAR:  Scalable  Stream-Processing  for  Sensor  Information 
Brokering 

6.  AUTHOR(S) 


7.  PEREORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Raytheon  BBN  Technologies, 10  Moulton  Street, Cambridge, MA, 02139 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 


12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

Military  Communications  Conference  (MILCOM),  November  18-20,  2013,  San  Diego,  CA,  pp.  1856-1861. 

14.  ABSTRACT 

Current  sensor  collection  capabilities  produce  an  incredible  amount  of  data  that  needs  to  be  processed, 
analyzed,  and  distributed  in  a  timely  and  efficient  manner.  Information  Management  (IM)  services 
supporting  a  publish-subscribe  and  query  paradigm  can  be  a  powerful  general  purpose  approach  to 
enabling  this  information  exchange  between  decoupled  and  dynamic  information  producers  and 
consumers.  These  IM  services  will  only  be  of  value,  however,  if  they  can  support  operations  in  a  manner 
that  is  responsive  to  the  sheer  quantity  and  frequency  of  data  produced  by  surveillance  platforms.  Cloud 
computing  is  the  technology  of  choice  for  providing  the  resources  and  services  needed  to  enable  and  mange 
large-scale  distributed  computation.  To  date,  there  has  been  little  work  to  develop  highly  scalable  dynamic 
IM  processing  and  dissemination  services  in  a  cloud  computing  environment.  In  this  paper  we  discuss  our 
design  implementation  and  evaluation  of  a  prototype  cloud-based  information  broker  which  is  a  critical 
component  of  a  highly  scalable  distributed  IM  System.  The  brokering  prototype  is  designed  using  a 
distributed  stream  processing  framework  and  is  shown  to  scale  nearly  linearly  with  the  number  of 
computing  nodes  as  information  load  and  subscription  quantity  increases. 

15.  SUBJECT  TERMS 


16.  SECURITY  CLASSIEICATION  OE: 

17.  LIMITATION  OE 

18.  NUMBER 

19a.  NAME  OE 

ABSTRACT 

OE  PAGES 

RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Same  as 
Report  (SAR) 

6 

3.  DATES  COVERED 

00-00-2013  to  00-00-2013 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

8.  PEREORMING  ORGANIZATION 
REPORT  NUMBER 

10.  SPONSOR/MONITOR’S  ACRONYM(S) 

11.  SPONSOR/MONITOR’S  REPORT 
NUMBER(S) 


Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


consumers  register  subscriptions  based  on  attributes  (i.e., 
metadata)  about  the  information  and  its  contents.  Published 
information  objects  are  matched  by  the  broker  against  the  reg¬ 
istered  subscription  predicates.  The  broker  tags  each  10  with 
the  endpoints  to  which  it  is  to  be  delivered.  The  brokered  ob¬ 
ject  is  then  passed  to  a  dissemination  service  which  transmits 
the  object  to  its  subscribers.  Information  is  then  disseminated 
to  each  consumer  only  those  information  objects  that  match 
the  consumer’s  interests.  We  utilize  explicit  submission  and 
dissemination  services  for  receiving  published  information  and 
transmitting  brokered  information  to  subscribers. 

We  focus  on  supporting  published  information  represent¬ 
ed  as  an  Information  Object  (10)  consisting  of  XML  metadata 
and  a  payload.  This  representation  is  fairly  standard  for  infor¬ 
mation  management  systems.  The  metadata  contains  details 
such  as  publisher  ID,  timestamps,  source  location,  payload 
format,  and  attributes  over  which  interest  can  be  registered. 
The  form  of  the  payload  will  be  dictated  by  the  publisher,  for 
instance  cameras  would  likely  push  out  binary  imagery  data  in 
a  payload. 

The  subscriptions  can  be  complex,  and  the  brokering  ser¬ 
vices  we  focus  on  support  brokering  that  can  match  over  rich 
sets  of  metadata  using  the  XPath  language  to  represent  sub¬ 
subscription  expressions  [21]  which  are  to  be  matched  against 
the  metadata  of  published  lOs.  XPath  is  a  rich  expression  lan¬ 
guage  and  can  be  computationally  expensive.  Computational 
cost  can  grow  non-linear ly  with  the  number  of  predicates  in  an 
XPath  expression,  and  metadata  on  published  data  may  be 
very  large. 

A  key  difference  between  our  information  brokering  capa¬ 
bility  and  many  others  is  support  for  richer  “matching”  of  bro¬ 
kered  information  with  subscriptions.  DDS,  for  example,  uti¬ 
lizes  a  topic  based  subscription  model  in  which  published  data 
contains  a  topic  field  and  subscribers  who  have  registered  for 
that  specific  topic  will  receive  the  data.  This  topic-based  sub¬ 
scription  model  is  a  coarse  division  of  all  information  into 
logical  groups  based  on  a  shared  topic  name. 

Challenges  in  supporting  a  scalable  brokering  service  in  a 
cloud  computing  environment  at  large  scales  include  that  the 
service  must  be  able  to  (1)  scale  to  the  throughput  of  published 
data  objects  (2)  scale  with  the  quantity  and  complexity  of  reg¬ 
istered  subscription  predicates  and  published  metadata,  and  (3) 
maintain  up-to-date  and  consistent  views  of  registered  sub¬ 
scriptions.  These  challenges  are  related  to  the  number  of  pub¬ 
lisher  and  subscriber  clients.  An  individual  publisher  may  cre¬ 
ate  much  more  data  than  other  publishers  and  similarly  any 
individual  subscriber  may  register  more  or  more  complex  sub¬ 
scriptions  than  another  subscriber. 

There  has  been  surprisingly  little  related  work  in  the  appli¬ 
cation  of  cloud  computing  technologies  to  the  problem  of  in¬ 
formation  brokering.  Apache  Kafka  [5]  is  a  distributed  and 
scalable  commit  log  service  which  provides  brokering  capabil¬ 
ities  using  a  topic-based  subscription  model.  Prior  research  in 
scalable  brokering  has  focused  on  challenges  of  geographical¬ 
ly  distributed  multicast  [6]  [7].  G-QoSM  [2]  describes  a  mech¬ 


anism  for  QoS  management  in  a  grid  computing  architecture 
by  reserving  some  of  the  system  capacity  for  utilization  by 
certain  classes  of  operations  if  there  is  resource  failure  or  con¬ 
gestion.  This  elasticity  is  similar  to  the  benefits  we  provide  in 
our  SCIMITAR  approach,  but  through  a  different  underlying 
brokering  mechanism  that  accounts  more  explicitly  for  low- 
level  resource  allocation.  A  key  benefit  of  our  SCIMITAR 
approach  is  that  it  is  easier  to  use  than  these  prior  approaches, 
to  better  fit  the  elastic  resource  allocation  model  encouraged 
by  cloud  providers.  Kang  et  al  provides  an  approach  for  re¬ 
source  provisioning  in  stream-based  services  but  without  pub- 
lish-subscribe  capabilities  [13].  Tudoran  et  al  describes  an 
approach  to  streaming  data  analysis,  but  without  publish- 
subscribe  capabilities  [17].  Zhu  et  al  [22]  provides  an  ap¬ 
proach  to  cloud-based  publish-subscribe  that  does  not  provide 
the  breadth  of  capabilities  provided  in  SCIMITAR. 

III.  Cloud  Computing  and  Stream  Processing 

When  designing  cloud-based  brokering  capability,  we  con¬ 
sidered  the  design  of  a  distributed  computing  architecture  as 
the  largest  challenge.  This  architecture  needs  to  support  the 
desirable  features  of  cloud  computing  including  horizontal 
scalability  and  resilience  to  individual  computing  node  failures. 
Horizontal  scalability  is  the  ability  of  a  system  to  provide  in¬ 
creased  performance  as  more  computational  resources  are 
available.  The  distributed  computing  architecture  needs  to  ena¬ 
ble  low  latency  and  “exactly-once”  delivery  of  published  in¬ 
formation  to  subscribers  as  required  for  practical  Pub -Sub  mid¬ 
dleware.  By  “exactly  once”  delivery  we  mean  that  the  broker¬ 
ing  capability  should  not  deliver  replicated  information  to  a 
subscriber. 

In  order  to  support  these  requirements  we  focus  our  design 
and  development  on  stream  processing  based  frameworks. 

Based  on  a  cursory  review  of  the  current  cloud  computing 
distributed  computing  paradigms,  one  might  consider  use  any 
of  the  highly  scalable  batched  Map-Reduce  technologies  as,  for 
example,  implemented  in  Hadoop  [10].  Although  extremely 
scalable  for  information  processing,  this  approach  cannot  pro¬ 
vide  a  scalable,  low-latency  approach  to  information.  Hadoop 
needs  to  register  information  in  the  Hadoop  NameNode  ser¬ 
vice,  and  then  read  from  disk  for  any  brokering  function  that 
could  be  supported  by  Hadoop.  Whereas  successful  uses  of 
batched  MapReduce  support  iterative  estimation  of  parameters 
for  information  pulled  from  disk,  our  needs  for  brokering  on 
10 s  already  stored  in  memory  do  not  align  well  with  Hadoop. 

We  utilize  an  alternative  and  increasingly  popular  paradigm 
called  stream  processing  to  develop  a  cloud-based  information 
brokering  prototype.  In  the  stream  processing  framework, 
“flows”  of  data  are  ingested  and  the  stream  processing  frame¬ 
work  uses  parallel  computation  to  process  this  data  across  mul¬ 
tiple  compute  nodes. 

Until  recently,  most  stream  computing  frameworks  have 
been  proprietary,  such  as  IBM’s  Infosphere  Streams  [18].  Re¬ 
cently  emerging  open-source  stream  processing  frameworks 
include  Apache  S4  [15]  and  Storm  [16].  These  two  open- 
source  streaming  frameworks  provide  near-real-time  response 
to  input  data  while  also  enabling  horizontal  scalability.  Storm 
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provides  fault-tolerance  guarantees  on  data  processing  despite 
occurrences  of  failures  of  individual  computing  nodes.  Storm  is 
designed  to  restart  failed  computation  in  response  to  partial 
failures  by  maintaining  limited  state  information.  We  selected 
Storm  as  the  basis  for  our  SCIMITAR  brokering  capability  due 
to  Storm’s  general  computing  model  and  support  for  guaran¬ 
teed  processing  of  published  information. 

In  the  Storm  framework,  streams  of  data  are  unbounded 
lists  of  tuples.  Spouts  represent  the  data  sources  that  emit  pos¬ 
sibly  multiple  streams.  Bolts  perform  the  computation  to  take 
streams  and  convert  them  to  output  streams.  A  Storm  topology 
is  a  network  of  bolts  connected  by  flows  of  streams  and 
streams  emanating  from  possibly  multiple  spouts. 


scription  group  to  scale  with  increased  information  throughput 
and  2)  increasing  the  number  of  subscription  groups  to  scale 
with  a  larger  quantity  of  subscriptions  or  more  computationally 
complex  subscriptions. 

To  support  our  third  design  goal  of  maintaining  up-to-date 
and  consistent  views  of  registered  subscriptions  we  implement¬ 
ed  a  persistent  globally  accessible  database  store.  Subscription 
modification  processes  listen  for  commands  from  clients  to 
add/remove/update  subscriptions  and  then  update  the  subscrip¬ 
tion  store  and  send  update  commands  to  the  appropriate  sub¬ 
scription  group.  As  brokering  processes  within  a  subscription 
group  start,  they  query  the  Subscription  Store  with  its  Group  ID 
to  receive  all  current  subscriptions. 


IV.  Design  and  Implementation 

To  scale  with  data  throughput,  subscriptions  must  be  repli¬ 
cated  across  multiple  broker  processes  in  a  manner  that  allows 
more  brokers  to  be  spawned  for  higher  throughput  situations 
and  each  published  10  to  be  dispatched  to  one  of  these  bro¬ 
kers.  To  scale  with  the  number  of  subscriptions,  subscriptions 
must  be  distributed  across  multiple  broker  processes,  and  pub¬ 
lished  lOs  must  be  passed  to  each  of  these  brokers  to  ensure 
that  it  is  compared  against  every  active  subscription.  These 
two  situations  are  illustrated  in  Figure  1 . 


To  provide  this  functionality,  SCIMITAR  distributes  sub¬ 
scriptions  across  multiple  subscription  groups.  Each  subscrip¬ 
tion  group  is  responsible  for  matching  published  information 
against  an  exclusive  subset  of  all  subscriptions.  Within  each 
subscription  group  is  a  set  of  replicated  brokering  processes 
that  compare  published  information  against  the  subscriptions 
assigned  to  its  group.  By  passing  each  published  10  to  a  single 
brokering  process  within  every  subscription  group,  each  10  is 
matched  against  every  active  subscription. 

All  output  from  the  subscription  groups  is  sent  to  a  set  of 
parallel  filters  which  remove  duplicates  and  forward  brokered 
lOs  to  the  dissemination  service.  The  specific  filtering  process 
used  is  based  on  a  hash  of  each  10 ’s  unique  ID  ensuring  the 
same  filter  sees  all  of  the  endpoints  for  a  given  10.  Additional¬ 
ly,  this  processing  layer  caches  which  consumer  endpoint  an  10 
is  being  delivered  to  and  filter  out  any  duplicates. 


This  design  allows  us  to  meet  our  two  scalability  goals  1) 
increasing  the  number  of  brokering  processes  within  each  sub- 


Information  Brokering  Service  can  be 
replicated  to  handle  increases  in  lO  rate 


Large  number  of  subscriptions  can  be 
distributed  across  multiple  brokers 


Figure  1  Parallelization  of  Brokering 


In  the  context  of  the  Storm  framework,  we  designed  Broker 
Bolts  which  operate  as  the  brokers.  Internally,  the  brokering 
bolt  iterates  over  subscriptions  for  each  10  received  and  imme¬ 
diately  outputs  matches.  The  output  of  the  broker  bolts  is  fed  to 
the  Filter/Output  Bolts.  The  filters  contain  a  TimeCacheMap  of 
10  UID  to  List  of  Endpoints.  Field  Grouping  on  10  UID  en¬ 
sures  that  the  same  worker  gets  each  endpoint  for  a  given  UID 
to  reduce  the  likelihood  of  a  subscriber  receiving  duplicate  lOs, 
but  there  are  still  cases  where  duplicates  might  be  delivered.  A 
Field  Grouping  uses  a  specific  Field  of  the  tuple  to  identify 
which  downstream  task  that  tuple  should  be  assigned  to.  In 
other  words,  the  Field  Grouping  on  10  UID  means  every  tuple 
where  10  UID  is  set  to  “A”  will  be  delivered  to  the  same  task. 
If  a  worker  dies  then  a  different  worker  may  be  assigned  that 
UID  and  duplicates  may  be  delivered.  Also  in  our  design  if  the 
processing  of  an  10  takes  too  long  then  its  entries  may  have 
been  removed  from  the  cache  and  duplicates  may  be  delivered. 

We  developed  a  Subscription  Modification  Bolt  to  manage 
subscription  updating.  The  Subscription  Modification  Bolt 
communicates  with  the  Subscription  Store,  which  is  imple¬ 
mented  using  a  PostgreSQL  DB.  This  database  stores  the  de¬ 
tails  of  subscriptions,  as  well  as  to  which  subscription  group 
each  is  assigned.  A  schematic  of  this  occurring  is  shown  in 
Figure  3  where  specialized  subscription  modification  spouts  in 
Storm  receive  subscription  update  tasks  which  are  routed  to  the 
appropriate  subscription  groups  and  subsequently  to  brokers. 

Input  into  the  system  is  provided  via  a  distributed  buffer 
implemented  using  Kestrel  [14],  The  10  spout  polls  the  buffer 
for  published  lOs,  while  the  Subscription  Modification  spout 
polls  a  second  set  of  buffers  for  commands  to 
add/remove/update  subscriptions.  The  output  bolt  places  the 
brokered  10  into  a  similar  output  buffer,  which  test  clients  con¬ 
tinuously  poll.  A  schematic  of  this  architecture  is  shown  in 
Figure  2.  In  a  deployed  system,  a  dissemination  service  would 
perform  this  task  and  then  deliver  the  10  to  the  correct  end¬ 
points. 


V.  Experimental  Evaluation 

To  evaluate  the  performance  of  our  cloud-based  brokering  pro¬ 
totype,  we  conducted  experiments  that  evaluated  correctness 
and  scalability  with  respect  to  throughput  as  a  function  of  the 
number  of  Storm  Supervisor  nodes  and  subscriptions.  We  eval¬ 
uate  scalability  in  terms  of  the  rate  of  published  lOs  and  the 
number  of  registered  subscriptions.  We  performed  two  sets  of 
scalability  experiments: 


1858 


Figure  3  Storm  Topology 


1.10  Input  Rate  Scalability  to  measure  how  10  brokering 
throughput  changes  as  a  function  of  10  input  rate  and  the 
number  of  compute  nodes  used  for  brokering. 

2.10  Subscription  Scalability  to  measure  how  10  brokering 
throughput  changes  as  a  function  of  the  number  of  registered 
subscriptions  and  compute  nodes  used  for  brokering. 


Additionally,  we  have  results  from  experiments  indicating  that 
subscription  groups  are  a  feasible  way  to  minimize  latency. 


A.  Experimentation  Environment 

We  developed  a  custom  deployment  infrastructure  and  a  test 
framework  to  support  experimentation  on  our  cloud-based 
brokering  services.  The  basis  of  our  deployment  framework  is 
the  capability  to  create,  populate,  configure,  and  run  the  VMs 
necessary  for  our  cloud-based  brokering  capability.  This 
framework  maps  the  requirements  of  our  prototype  cloud  ser¬ 
vices  to  “roles,”  each  containing  requirements  for  software 
and  a  VM  and  each  containing  instructions  on  how  to  config¬ 
ure  the  software  and  system  to  interact  with  the  other  nodes  in 
the  prototype  services  cloud.  We  wrote  this  automated  de¬ 
ployment  infrastructure  with  the  boto  API  [1].  Boto  provides 
an  interface  to  several  Infrastructure  as  a  Service  (laaS)  cloud 
frameworks  including  Amazon  Web  Services  and  Eucalyptus. 
For  load  testing,  we  used  The  Grinder  [9],  a  Java  load  testing 
framework  that  facilitates  running  a  distributed  test  using 
many  load  injector  machines. 


Our  experimental  framework  consists  of  virtual  machines 
providing  the  following  roles:  publisher  and  subscription  cli¬ 
ents  running  Grinder  agents.  Kestrel  queues  serving  as  input 
and  output  buffers,  nodes  running  ZooKeeper  and  a  node  run¬ 
ning  the  Storm  Nimbus  service  to  coordinate  the  Storm  topol- 


Figure  2  SCIMITAR  Architecture 


ogy,  a  node  serving  as  a  Grinder  console  to  coordinate  the  test 
clients,  a  node  running  PostGreSQL  to  provide  persistent  sub¬ 
scription  storage,  and  nodes  running  Storm  Supervisors  which 
we  refer  to  as  worker  nodes  and  which  were  responsible  for 
the  brokering  computation. 

B.  Experiment  Execution  to  Investigate  Scalability  from 

adding  more  Worker  Nodes  for  Brokering 

After  deployment,  we  ran  an  initial  experiment  to  collect 
data  on  how  10  throughput  performance  varies  with  the  num¬ 
ber  of  brokering  nodes  while  supporting  either  64  or  128  sub¬ 
scriptions.  We  ran  this  initial  set  of  experiments  to  verify  that 
the  effect  of  scaling  the  number  of  nodes  would  be  consistent 
across  different  numbers  of  registered  subscriptions.  Figure  4 
shows  the  graph  from  one  of  the  128  subscriptions  runs,  the 
one  with  32  nodes.  For  these  experiments  the  published  lOs 
contained  1  KB  payloads  and  metadata  which  contained  less 
than  20  lines  of  XMF. 

Table  1  presents  the  collected  data  which  relates  the  maxi¬ 
mum  brokering  throughput  in  10 s  per  second  as  a  function  of 
the  number  of  brokering  worker  nodes  and  subscriptions.  Giv¬ 
en  the  variability  in  this  measurement  over  time,  we  collected  2 
significant  digits  of  data.  This  data  is  also  shown  as  a  graph  in 
Figure  4. 

As  can  be  seen  from  Figure  4,  throughput  appears  to  grow 
linearly  with  the  number  of  worker  nodes  used  for  brokering. 
This  is  the  behavior  we  expected  and  is  highly  desirable  as  it 
provides  a  simple  approach  to  scale  our  brokering  prototype  to 
deal  with  high  throughput  by  adding  more  brokering  compute 
nodes.  This  linear  scaling  shows  that  we  can  deal  with  increas¬ 
ingly  large  throughput  requirements  by  adding  extra  computing 
nodes  as  brokering  workers. 

We  used  the  standard  linear  regression  technique  to  esti¬ 
mate  the  slope  of  the  lines  which  we  also  plot  in  Figure  4.  The 
slope  of  the  64  subscription  throughput  best-fit  line  (56  lOs  per 
sec.  per  node)  is  almost  double  the  slope  of  the  128  subscrip¬ 
tion  throughput  best-fit  line  (32  lOs  per  sec.  per  node.)  These 
measurements  indicate  that  the  brokering  throughput  may  be 
close  to  inversely  proportional  to  the  number  of  subscriptions. 
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Figure  4.  lO  Throughput  with  Best-fit  Lines 
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Table  1.  lO  Throughput  as  a  function  of  the  number 
of  brokering  compute  nodes  for  two  different  num- 
_ bers  of  registered  subscriptions. _ 


Number  of 
Brokering  Nodes 

64  Subscriptions 

128  Subscriptions 

4 

200 

100 

8 

430 

250 

16 

890 

490 

32 

1600 

1000 

64 

3600 

2000 

This  near-inverse-proportional  behavior  is  expeeted  based  on 
our  design  in  that  it  intuitively  takes  longer  to  broker  over  more 
subseriptions.  We  investigate  the  inverse  proportional  relation¬ 
ship  further  in  an  experiment  set  below,  and  we  provide  a  more 
detailed  analysis  of  the  relationship  between  throughput  and  the 
number  of  subseriptions. 

We  did  not  explore  the  far  upper  limits  of  the  linear  seala- 
bility  of  10  throughput  with  respeet  to  the  number  of  worker 
nodes  used  for  brokering.  Our  hypothesis  is  that  there  are  limits 
to  how  far  our  experimental  eloud-based  brokering  framework 
ean  seale  based  either  on  the  underlying  infrastrueture,  inter¬ 
node  eommunieation  bottleneeks,  or  the  stream  proeessing 
framework.  An  exploration  of  the  upper  limits  of  the  sealability 
of  our  framework  is  an  area  of  ongoing  and  future  work. 

Although  we  did  not  analyze  how  sealability  varies  based 
on  10  size,  our  expeetation  is  that  10  size  will  not  greatly  affeet 
sealability,  but  there  is  likely  to  be  an  upper  limit  on  the  eapa- 
bility  of  our  experimental  framework  to  handle  very  large  mul¬ 
ti-gigabyte  10 s  at  very  high  throughputs. 

Beeause  the  suspeeted  over-flow  effeet  oeeurs  only  when 
the  input  rate  approaehes  the  maximum  brokering  rate,  and 
beeause  the  maximum  brokering  rate  is  highly  linear  with  re¬ 
speet  to  the  number  of  eomputing  nodes,  we  believe  it  is  feasi- 
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Figure  5.  lO  Throughput  as  a  Function  of  Number  of 
Subscriptions. 


Table  2.  lO  throughput  as  a  function  of  the  number  of 
registered  subscriptions  for  two  different  numbers  of 
_ brokering  compute  nodes. _ 


Number  of 
Subscriptions 

Throughput 
(lOs/second)  32 
Nodes 

Throughput 

(IOs/second)64 

Nodes 

64 

793 

3500 

128 

515 

1900 

256 

393 

1450 

512 

204 

1050 

1024 

104 

550 

ble  to  predict  when  the  buffer-overflow  will  occur.  As  such,  the 
experimental  observations  in  Figure  4  imply  that  it  should  be 
straightforward  to  select  the  number  of  brokering  compute 
nodes  that  should  be  allocated  in  a  cloud  for  our  brokering  pro¬ 
totype  to  successfully  process  desired  brokering  throughputs. 

C.  Experiment  Execution  to  Investigate  Scalability  from 

adding  Subscriptions 

As  with  the  experiment  set  that  analyzes  the  impact  of  add¬ 
ing  more  worker  nodes  for  brokering,  we  collected  data  on  the 
maximum  brokering  throughput  as  a  function  of  the  number  of 
subscriptions  for  various  brokering  worker  node  configura¬ 
tions.  Table  2  presents  the  maximum  brokering  throughput  in 
numbers  of  lOs  per  second  as  a  function  of  the  number  of  bro¬ 
kering  compute  nodes  and  subscriptions.  This  data  is  also 
shown  as  a  graph  in  Figure  5. 

Our  experimental  data  indicates  that  the  brokering  through¬ 
out  is  inversely  proportional  to  the  number  of  subscriptions. 
This  behavior  is  expected  as  it  intuitively  takes  longer  to  broker 
over  more  subscriptions.  To  investigate  the  relationship  be¬ 
tween  10  throughput  and  the  number  of  subscriptions  in  our 
brokering  prototype,  we  plotted  this  relationship  in  a  log-log 
plot  shown  in  Figure  6. 

We  used  the  standard  linear  regression  approach  to  find 
best-fit  lines  for  the  experimental  data  as  shown  in  Figure  6. 
We  can  map  lines  in  the  log-log  graph  back  to  the  linear-linear 
graph  to  identify  the  relationship  between  the  throughput  and 
the  number  of  subscriptions.  We  start  from  the  following  equa¬ 
tion  which  represents  how  our  experimentally  identified  linear 
relationship  relates  throughput  and  the  number  of  subscriptions 
in  the  log-log  domain: 

logy  =  mlogx  +  b 

where 

•  y  represents  throughput 

•  X  represent  the  number  of  subscriptions 

•  m  and  b  are  parameters  that  determine  relationship 
between  throughput  and  subscriptions 

We  convert  this  equation  to  a  relationship  between  how  our 
experimentally  identified  linear  relationship  relates  throughput 
and  the  number  of  subscriptions  in  the  linear-linear  domain: 


10^ 
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As  can  be  seen  in  Figure  6,  the  slopes  of  the  linear  regres¬ 
sion  lines  for  the  experimental  data  are  very  close  for  the  32 
node  and  64  node  cases.  For  the  32  brokering  compute  node 
case  the  exponent  -m  is  0.89  and  the  64  brokering  compute 
node  case  the  exponent  -m  is  0.91. 

Our  experimental  results  show  that  increases  in  the  number 
of  brokering  compute  nodes  offsets  increases  in  the  number  of 
subscriptions  to  maintain  throughput.  Throughput  grows  linear¬ 
ly  with  the  number  of  brokering  compute  nodes,  but  sub- 
linearly  with  the  reciprocal  of  the  number  of  subscriptions.  As 
such,  if  our  prototype  needs  to  maintain  a  level  of  throughput 
despite  an  increase  in  the  number  of  subscriptions,  we  would 
need  to  add  a  relatively  smaller  number  of  extra  brokering 
compute  nodes  to  maintain  throughput.  This  shows  our  proto¬ 
type  design  can  scale  effectively  with  increases  in  subscriptions 
by  adding  more  compute  nodes. 

All  of  the  data  collected,  presented  and  discussed  until  now 
on  brokering  performance  was  derived  from  experiments  con¬ 
ducted  on  an  Amazon  cluster.  We  partially  recreated  our  exper¬ 
iments  on  an  internal  Eucalyptus  cluster  which  we  could  not 
scale  as  large  as  the  Amazon  environment  due  to  a  lack  of 
computation  resources.  We  recreated  our  subscription  scalabil¬ 
ity  experiments  on  the  relationship  between  throughput  and  the 
number  of  subscriptions  for  8  and  16  brokering  compute  node 
setups  in  our  private  cloud.  The  results  on  our  internal  cluster 
indicate  a  best- fit  linear  regression  in  the  log-log  domain  with  - 
m=0.97. 

VI.  Conclusions  and  Future  Work 

Our  research,  prototype  development,  and  evaluation  have  es¬ 
tablished  a  firm  basis  for  cloud-based  IM  systems.  It  has  shown 
that  brokering  of  lOs  for  client  requests  can  be  architected  and 
implemented  to  work  with  emerging  cloud-based  streaming 
platforms  and  that,  in  doing  so,  publish-subscribe  operations 
can  be  performed  in  the  cloud  at  high  speeds  and  at  scale,  both 
in  terms  of  number  of  published  objects  (throughput)  and  in 
terms  of  number  of  registered  subscriptions.  We  have  shown 
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Figure  6.  lO  Throughput  as  a  Function  of  Number  of 
Subscriptions  with  Best-fit  Linear  Regression  Lines. 


that  there  are  several  existing  and  emerging  cloud-based  tech¬ 
nologies  that  show  the  promise  of  being  applied  to  IM  service 
concepts  and  that  provide  varying  levels  of  benefit  for  IM  op¬ 
erations  in  the  cloud. 

A  next  step  is  to  design  and  develop  cloud-based  versions 
of  additional  IM  services  to  round  out  and  complement  our 
brokering  capability.  Messaging  services  such  as  HometQ  [8] 
and  Kestrel  [14]  are  promising  candidates  for  supporting  sub¬ 
mission  and  dissemination.  A  wide  range  of  distributed  data¬ 
bases,  e.g.,  Accumulo  [3],  Cassandra  [4],  and  MongoDB  [12], 
could  be  leveraged  to  back  archive  and  query  services. 
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